The Treegram Index|an Eecient Technique for Retrieval in Linguistic Treebanks under Consideration for Other Conferences (specify)? Acl
ثبت نشده
چکیده
In computational linguistics, large tree databases tagged with morpho-syntactic information are in need of fast retrieval of multiway tree structures. To tackle this problem, we present a generalization of the classical n-gram indexing technique called Treegram indexing. As an application of treegram indexing, we describe the Venona retrieval system, which handles the BH t treebank containing 508,650 phrase structure trees. 1 Tree Retrieval Multiway trees (MT, henceforth) play a central role in representing complex linguistic information because they are a common and well-understood data structure for describing hierarchical information. With the availability of large treebanks, retrieval techniques for highly structured data now become essential. One of the most well-known linguistic tree repositories is the Penn treebank of the University of Pennsylvania: Its fundament consists of a corpus containing 4.5 million words of American English; half of this corpus has been annotated for skeletal syntac-tical structure, cf. (Marcus et al., 1993).
منابع مشابه
The Treegram Index-An Efficient Technique for Retrieval in Linguistic Treebanks
Multiway trees (MT, henceforth) are a common and well-understood data structure for describing hierarchical linguistic information. With the availability of large treebanks, retrieval techniques for highly structured data now become essential. In this contribution, we investigate the efficient retrieval of MT structures at the cost of a complex index--the Treegram Index. We illustrate our appro...
متن کاملMultiway-Tree Retrieval Based on Treegrams
Large tree databases as knowledge repositories become more and more important; a prominent example are the treebanks in computational linguistics: text corpora consisting of up to five million words tagged with syntactic information. Consequently, these large amounts of structured data pose the problem of fast tree retrieval: Given a database T of labeled multiway trees and a query tree q, find...
متن کاملEecient Parsing for Bilexical Context-free Grammars and Head Automaton Grammars
Word Count: 3199 (using detex 2.6) Under consideration for other conferences (specify)? no Abstract Several recent stochastic parsers use bilexical grammars, where each word type idiosyncratically prefers particular complements with particular head words. We present O(n 4) parsing algorithms for two bilexical formalisms, improving the previous upper bounds of O(n 5). Also, for a common special ...
متن کاملComputing Translation Units and Quantifying Parallelism in Parallel Dependency Treebanks
The linguistic quality of a parallel treebank depends crucially on the parallelism between the source and target language annotations. We propose a linguistic notion of translation units and a quantitative measure of parallelism for parallel dependency treebanks, and demonstrate how the proposed translation units and parallelism measure can be used to compute transfer rules, spot annotation err...
متن کاملEecient Probabilistic Top-down and Left-corner Parsing Submission Type: Thematic Session Topic Areas or Theme Id: M5 Word Count: 3170 under Consideration for Other Conferences (specify)? None
This paper examines eecient predictive broad-coverage parsing without dynamic programming. In contrast to bottom-up methods, top-down parsing produces partial parses that are fully connected trees spanning the entire left context, from which any kind of non-local dependency or partial semantic interpretation can in principle be read. We contrast top-down and left-corner parsing, and nd both to ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999